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ABSTRACT 


The  maximum  likelihood  (ML)  method  has  been  used  by  Itakura  and 
Saito  [1]  to  derive  a  nonlinear  spectral  matching  criterion  for  estimating 
the  spectral  parameters  of  autoregressive  (AR)  processes.  In  this  paper 
it  is  shown  that  their  spectral  matching  criterion  is  a  general  property 
of  ML  spectral  estimation  in  that  it  is  valid  for  any  spectral  model  and 
applies  to  aperiodic  and  periodic  random  processes. 

An  exact  solution  to  the  ML  parameter  estimation  problem  for  AR 
processes  has  recently  been  derived  by  Kay  [2].  These  results  are  cast 
in  a  frequency  domain  formulation  which  is  used  to  generalize  the  exact 
solution  to  periodic  processes.  It  is  then  shown  that  if  the  number  of 
independent  power  measurements,  N,  greatly  exceeds  the  model  order,  M, 
then  the  ML  algorithm  reduces  to  a  pitch-directed  frequency  domain 
version  of  Linear  Predictive  (LP)  spectral  analysis. 

The  exact  solution  is  then  used  to  determine  the  spectral  envelope 
for  voiced  (periodic)  and  unvoiced  (aperiodic)  speech  and  it  is  observed 
that  the  exact  analysis  results  in  fits  that  broaden  the  formant  bandwidths 
while  reducing  the  formant  amplitudes.  A  real-time  vocoder  was  developed 
and  it  was  found  that  in  contrast  to  a  standard  LPC  algorithm  the  exact 
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I.  INTRODUCTION  AND  SUMMARY 


Itakura  and  Saito  [1]  have  shown  that  spectral  envelope  estimation 
using  linear  predictive  coding  techniques  (LPC)  has  a  more  fundamental 
theoretical  basis  in  maximum  likelihood  (ML)  estimation.  Furthermore, 
they  have  used  this  theory  to  develop  a  spectral  matching  interpretation 
in  terms  of  the  Itakura-Saito  criterion.  Their  basic  mathematical  model 
dealt  with  speech  waveforms  that  were  sample  functions  of  an  autoregressive 
random  process  for  which  the  spectrum  was  not  harmonic.  While  this  is 
the  correct  model  for  the  class  of  unvoiced  sounds,  one  wonders  if 
perhaps  the  results  are  valid  fov  voiced  speech  sounds  as  well,  since  in 
this  case  the  waveforms  are  periodic  with  spectra  having  distinct  harmonic 
line  structure.  This  is  the  problem  addressed  in  this  paper. 

In  setting  up  the  formalism  for  the  application  of  the  ML  method  to 
both  aperiodic  and  periodic  processes,  it  was  not  necessary  to  impose 
the  all-pole  constraint  on  the  model  spectrum.  The  ensuing  analysis  led 
to  a  spectral  matching  criterion  identical  to  that  obtained  by  Itakura 
and  Saito,  which  shows  that  the  criterion  is  a  fundamental  property  of 
the  maximum  likelihood  method.  Since  this  is  a  deterministic  function 
of  the  measured  and  model  spectra,  the  underlying  Gaussian  statistical 
model  is  no  longer  a  significant  limitation  to  the  ultimate  validity  of 
the  results  and,  for  all  intents  and  purposes,  can  be  ignored.  Furthermore, 
this  interpretation  shows  that  in  the  periodic  case,  the  model  spectrum 
is  fitted  to  the  power  measurements  at  the  harmonic  frequencies. 

In  a  recent  paper  Kay  [2]  obtained  an  exact  expression  for  the  ML 
estimates  of  the  parameters  of  an  autoregressive  process  using  a  covariance 
domain  analysis.  By  modifying  Kay's  analysis  to  reflect  spectral  domain 
properties,  the  ML  formalism  derived  in  this  paper  could  be  used  to 


obtain  an  exact  solution  for  the  parameters  of  an  all -pole  spectral 
envelope  for  aperiodic  and  periodic  processes.  It  was  shown  that  if  the 
number  of  independent  power  measurements,  N,  (the  number  of  pitch  harmonics 
in  the  periodic  case)  greatly  exceeds  the  model  order,  M,  then  the  ML 
algorithm  reduced  to  a  pitch-directed  frequency  domain  version  of 
linear  prediction  (LP)  spectral  analysis.  In  this  case  the  fundamental 
measurement  set  is  not  a  set  of  correlation  coefficients,  but  a  set  of 
power  spectrum  measurements. 

The  exact  ML  solution  was  then  used  to  determine  the  spectral 
envelope  for  voiced  (periodic)  and  unvoiced  (aperiodic)  speech.  It  was 
observed  that  for  voiced  speech  the  exact  method  led  to  a  "less  faithful" 
reproduction  of  the  spectral  measurements  and  resulted  in  broadened 
formant  bandwidths  and  reduced  formant  amplitudes.  During  unvoiced 
speech  the  spectral  fits  were  in  general  agreement,  which  is  consistent 
with  the  theoretical  results.  In  order  to  determine  whether  or  not 
these  differences  were  perceptually  significant  a  real-time  analysis/synthesis 
system  was  developed.  It  was  found  that  the  synthetic  speech  produced 
by  the  exact  ML  algorithm  had  the  quality  of  being  too  heavily  smoothed. 
Although  this  had  the  effect  of  eliminating  small  spectral  distortions 
which  occasionally  occurred  in  the  approximate  (N>>M)  analysis,  the 
approximate  system  produced  synthetic  speech  which  was  more  natural. 
Furthermore,  when  compared  with  a  standard  autocorrelation  based  LPC 
vocoder  using  the  same  acoustic  tube  synthesizer,  the  pitch-directed 
frequency  based  approximate  ML  algorithm  did  not  result  in  synthetic 
speech  that  was  significant ly  better  either  in  quality  or  intelligibility. 

Based  on  the  experience  to  date,  it  appears  that  the  major  benefit 
of  the  exact  ML  analysis  for  aperiodic  and  periodic  processes  having 
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all-pole  spectral  envelopes,  is  the  fact  that  it  leads  to  a  unified 
theoretical  formulation  for  analyzing  voiced  and  unvoiced  speech.  It 
turns  out,  however,  that  the  frequency  domain  implementation  of  the 
spectral  analysis  algorithm  has  been  found  to  be  particularly  useful  in 
the  development  of  low  rate  systems  [3],  and  higher  quality  split-band 
vocoder  algorithms. 

II.  THEORETICAL  FOUNDATION 

In  order  to  provide  a  common  theoretical  framework  for  estimating 
the  spectra  for  voiced  (periodic)  and  unvoiced  (aperiodic)  speech,  it  is 
useful  to  model  the  speech  waveform  S(n)  as  a  sample  function  of  a 
suitably  defined  discrete  time  random  process.  A  key  step  in  establishing 
the  analytical  framework  is  to  expand  S(n)  in  terms  of  a  set  of  basis 
functions  in  such  a  way  that  the  expansion  coefficients  are  uncorrelated 
random  variables.  For  unvoiced  speech,  the  theory  requires  that  the 
sample  functions  be  defined  on  a  finite  interval  N  points  long.  A 
particularly  important  set  of  basis  functions  are  the  complex  exponentials, 
for  which  the  series  expansion  is 
N-l 

S(n)  =  2  Xk  exp  (jnwk)  (1) 

k=o 

where  tu,  =  2irk/N  and 
k 

1  N-: 

Xk  =  N  2  exp 

n=o 

It  is  useful  to  determine  the  conditions  under  which  the  complex  exponentials 
can  be  considered  to  be  eigenfunctions  of  the  covariance  matrix  R(n,m)  = 
E[S(n)S(m)].  This  occurs  when 

N-l 

*k  exp  (jmok)  =  £  R(n,m)  exp  (jmwk)  (3) 
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for  some  value  of  If  the  unvoiced  speech  process  is  wide  sense 

stationary  then 


R(n,m) 


R(n-m) 

ir 

i/pM  exp 

-IT 


[juj(n-m)  ]  doj 


(4) 


where  P(w)  represents  the  power  spectral  density.  Substituting  (4)  into 
(3)  leads  to  the  eigenvalue  equation 

1  T 

Xk  exp  (jna)k)  =  2-n  S  p(“)  exP  Onto)  g(o>k-w)du>  (5) 

-TT 

where 


N-l 

g (9)  =  £  exp  (jm0) 
m=o 


=  exp[)(N- 1)9/2) 


sin(N9/2) 

sin(6/2) 


(6) 


If  the  frame  length  N  is  large  enough  that  the  power  spectral  density 
changes  slowlv  in  a  frequency  increment  1/N,  then  relative  to  the  function 
P(w),  g(“^-“)  can  be  considered  to  be  an  impulse  at  Under  these 

conditions  (S)  reduces  to 


exp  (jnuk)  «  P(a>k)  exp  (,inu>k)  (7) 

which  shows  that  the  complex  exponentials  are  valid  eigenfunctions  of 
R(n,m).  The  associated  eigenvalues  are 

»k  ■  LP(“ki  (8) 


Finally,  using  (2)  and  (3)  it  is  easy  to  show  that  the  correlation 
between  expansion  coefficients  is  given  by 
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19) 


E(xkxi>  ■  sh  p(“k>5ki 


the  desired  result. 

For  voiced  speech  the  analysis  presumes  that  the  random  process 

S(n)  is  periodic  where  N  now  represents  the  period.  This  means  that  the 

* 

covariance  function  R(m)  =  E[S(n)S  (n+m)]  is  periodic  with  period  N. 
Although  this  model  has  no  relevance  to  the  physiological  mechanism  by 
which  voiced  sounds  are  generated,  mathematically  it  can  be  used  to 
generate  a  class  of  spectra  that  have  roughly  the  same  properties  as 
voiced  speech  spectra,  and  hence  it  was  adopted  as  being  suitable  for 
the  type  of  analysis  to  be  undertaken.  Now  if  the  covariance  function 
is  periodic,  then  R(m)  =  R(m+N)  for  all  m,  and  as  a  consequence  can  be 
expanded  as 

N-l 

R(m)  =  2  exp  (jk^l  (10) 

k=o 

where  now  w  =  2-rrm/N,  and  where 
m 

N-l 

pk  =  2  R(*n)  exp  (-jko^)  (11) 


which  specifies  the  discrete  time  power  spectrum  of  the  voiced  speech 
process.  A  series  representation  for  the  original  random  process  can  be 
taken  to  be 

N-l 

S (n)  =  2  Xk  exp  (jnui^)  (12) 

k=o 

where 

1  N-! 

\  2  S(n)  exp  (-.inu^)  (13) 

n=o 

It  is  easy  to  show  that  the  correlation  between  the  voiced  speech  expansion 
coefficients  is  given  by 
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Therefore,  both  voiced  (periodic)  and  unvoiced  (aperiodic)  speech  can  be 
expanded  in  terms  of  a  set  of  uncorrelated  random  variables  with  respect 
to  the  complex  exponential  basis.  For  unvoiced  speech  the  basis  was 
defined  on  an  interval  N  such  that  in  a  frequency  increment  1/N,  the 
unvoiced  power  spectral  density  was  slowly  changing.  For  voiced  speech 
the  basis  was  defined  on  an  interval  N  that  was  simply  one  period  of  the 
periodic  covariance  function.  The  spectrum  estimation  problem  for  the 
voiced  and  unvoiced  speech  cases  can  be  specified  in  terms  of  a  common 
parameter  estimation  problem  by  defining  the  "spectrum"  by 


P(x1,x2,...,xN|e) 


N- 1 

7T 

n=o 


1 


,xnW 


exp[- 


■1 


M  f  N-l  lx  |“  1 

=  it"  exp  j  -  2  1  y  "y  +  log  Xn(6)]  >■  H6) 

|_  n=o  n  ' '  J 

The  ML  estimates  of  £  are  found  by  maximizing  this  pdf.  This  is  equivalent 
to  minimizing  the  negative  of  the  logarithm  of  the  pdf  which  is  called 
the  likelihood  function  and  is  written  as 


Me)  =  -  log  [P(x1>x2,...xN|e)] 


N-l 

=  N  log  TT  +  2 

n=o 


LV6) 


+  log  xn(e) 


(17) 


From  this  equation  the  ideas  first  proposed  by  Itakura  and  Saito  [1]  can  oe 
used  to  develop  a  spectral  matching  interpretation  of  the  ML  criterion. 

The  first  step  is  to  obtain  a  lower  bound  on  the  likelihood  function 
by  obtaining  the  ML  estimates  for  the  unconstrained  problem. 

For  this  case  there  are  N  unknown  parameters  >n  and  the  ML  estimates  are 

*  7 

easily  shown  to  be  An  =  |Xn|  .  The  resulting  minimum  value  of  the 
likelihood  function  is 


N-l 


i  .  =  N  log  -n  +  2  (l  +  log|X  |“) 


(18) 


The  likelihood  function  at  any  other  value  can  then  be  written  as 

N^]i  r  /iAn|2 

2(8)  -  2 


vLPr, i^l)i 

n=ol  L  U  (?)  /J 


X^ 


-  log 


-  1 


(!!!) 


The  next  step  is  to  define  the  quantity 


E(fn)  =  log  |Xnr  -  log  Xn(8) 


(20) 


which  measures  the  dB  error  between  the  power  at  frequency  fn  measured 
by  |  X  |  “■  and  estimated  by  the  model  as  >n  (8).  Using  this,  the  ML 
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criterion  in  (19)  can  be  expressed  by  the  nonlinear  spectral  matching 
condition 


«i>  '  ‘.in  ‘  Z  {•*p[B<f11)]  '  E‘f„>  - 

n=o'-  L  J  J 

Following  [1]  it  is  of  interest  to  contrast  this  result  with 
that  obtained  with  the  squared  dB  error  criterion  given  by 
N-l 

f(£)  =  I  [E(fn)r 


(21) 


(22) 


As  shown  in  Fig.  1,  this  condition  gives  equal  weight  to  dB  model  errors 
above  and  below  the  measured  power  samples,  whereas  the  ML  criterion 
gives  significantly  more  weight  to  errors  that  occur  when  the  model 
spectrum  lies  below  the  measured  data.  This  suggests  that  the  ML 
parameter  estimates  will  result  in  a  model  spectrum  that  "sits  on  top 
of"  the  spectral  measurements,  hence  it  would  appear  to  be  well  suited  for 
estimating  a  spectral  envelope  from  discrete  spectral  measurements. 

Although  these  properties  are  well-known  in  the  speech  community, 
they  were  thought  to  apply  only  to  the  case  of  aperiodic  speech  which 
could  be  modelled  in  terms  of  an  autoregressive  (all-pole)  process.  The 
significance  of  the  above  analysis  is  that  the  nonlinear  spectral 
matching  interpretation  applies  to  voiced  and  unvoiced  speech  and  does 
not  depend  on  a  specific  spectral  model  for  the  speech  generation  process. 
Hereafter,  we  shall  refer  to  (21)  as  the  maximum  likelihood  spectral 
matching  criterion. 

Another  advantage  of  the  maximum  likelihood  formulation  is  the 
insight  it  provides  into  the  way  in  which  the  spectral  measurements 
should  be  made.  For  unvoiced  speech  the  measurement  variables  are 

1  N-J  ,  x 

xk  =  jj-  I  S(n)  exp  [  -  j  2rnk  ]  (23) 

n=o  \  N  / 
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NONLINEAR  SPECTRAL  MATCHING  CRITERIA 


Fig.  1.  Maximum  likelihood  spectral  matching  criterion. 
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which  is  simply  the  Discrete  Fourier  Transform  (DFT)  of  the  speech 
process  sampled  at  frequency  n/N.  The  data  satisfy  the  conditions  of 
the  theory  provided  the  frame  length  N  is  large  enough  that  the  actual 
power  spectral  density  changes  slowly  in  a  frequency  increment  1/N. 

Since  it  is  standard  practice  in  vocoder  design  to  choose  an  analysis 
window  of  at  least  20  ms  and  since  typical  unvoiced  speech  spectra 
change  very  little  in  50  Hz,  then  the  conditions  of  the  theory  are  met 
in  practice.  For  voiced  speech  the  measurements  are 

1  N_1  /  \ 

Xk  =  N  2  S(n^  exP(li2mnk  1  (24) 

n=o  '  N  ' 

where  N  is  the  period  of  the  voiced  speech  process.  This  result  suggests 
that  the  pitch  must  be  known  and  used  explicitly  in  developing  the 
correct  measurement  set  for  computing  the  ML  estimates,  a  procedure 
which  is  not  commonly  used  in  practice.  This  is  an  extremely  important 
point  because  it  suggests  that  if  the  data  truly  correspond  to  a  periodic 
process,  then  simply  assuming  that  the  speech  can  be  modelled  as  an  AR 
process  is  not  sufficient  for  optimally  extracting  the  measurement 
variables.  This  issue  will  be  reexamined  later  in  the  sequel  after  the 
solution  to  the  all-pole  modelling  problem  has  been  derived. 

III.  ALL-POLE  MODELLING 


Although  the  ML  criterion  that  was  derived  in  the  previous  section 
can  be  used  to  estimate  the  parameters  of  model  spectra  of  arbitrary 
functional  form,  there  is  particular  interest  in  the  all-pole  model 
since  it  corresponds  to  autoregressive  (AR)  processes  that  arise  in 
diverse  applications  of  spectral  analysis,  and,  in  speech  in  particular, 
it  is  related  to  the  Linear  Prediction  method  of  speech  coding.  Recently 
Kay  [2]  has  developed  a  recursive  solution  to  the  exact  ML  problem  for 


aperiodic  AR  processes  using  a  covariance  domain  solution  technique  that 
is  an  extension  of  the  covariance  method  of  Linear  Prediction.  In  this 


section  Kay's  analysis  will  be  generalized  in  terms  of  the  spectral 
domain  formulation  of  the  ML  problem  that  was  developed  in  the  previous 
section,  and  will  thereby  yield  the  exact  ML  spectrum  for  aperiodic  and 
periodic  processes. 

The  analysis  begins  with  a  restatement  of  the  ML  estimation  problem 
for  processes  for  which  the  spectral  eigenvalues  have  the  all-pole  form: 

2 


Xn  <S>  = 


lwl‘ 


(25) 


where,  as  before,  =  2irn/N  and  where  M  is  the  model  order  and 

M  (M) 

W  =  1  -  2  a  m  exP  H»«V 

m=  1 

The  parameters  to  be  estimated  are 

0  =  To2  aW  a(M)  a(M)l 
e  L<V  a  y  a  2,  ....  a  M  j 


(26) 


(27) 


The  ML  estimate  of  £  is  obtained  by  minimizing  the  ML  spectral  matching 
criterion  in  (19)  or  equivalently  the  likelihood  function  in  (17)  which 
is 

N-l  lX  |2  N-l 

*(£)  =  2  — - -  +  2  log  A  (0)  +  N  log  IT  f2fn 

n=o  A  (8)  n=o  1  ’ 

n—y 


McAulay  [10]  has  shown  that  the  key  to  deriving  the  exact  ML  solution 
depends  on  the  derivation  of  an  alternate  expression  for  the  second  term 
in  (28).  As  a  first  step  it  is  obvious  that 


N-l 

2 

n=o 


(29) 
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where  A  is  the  NxN  diagonal  matrix 


A  =  diag  (A0,A1 . AN1)  (30) 

Hence,  the  problem  reduces  to  finding  an  equivalent  expression  for  this 
matrix.  This  can  be  done  by  noting  that  the  underlying  model  covariance 
matrix  satisfies  the  relations 

[N— 1  N-l 

I  x.rf.(t)  z  xw(s) 

i=o  j  =o  J  J 

N-l 

=  2  A  rf  (t)  **(S)  (31) 

n=o 

This  is  known  as  Mercer's  Theorem  and  follows  from  the  fact  that 
E(XjXt)  -  a  condition  which  was  derived  in  Section  II.  Equation 

(31)  can  be  written  in  matrix  notation  as 

R  =  $TA4>*  (32) 

by  defining  the  NxN  covariance  matrix  (R). .  =  R(i,j)  and  the  NxN 
transformation  matrix  4  as 

rfo(o)  «S0(1)  ...  0O(N-1) 

^(o)  tfjd)  ...  «S1(N-1) 

Vl(0)  ‘W”  ••• 

T 

where  4  and  4*  are  the  transpose  and  the  conjugate  of  the  matrix 
4  respectively.  As  a  consequence  of  (32) 

det(R)  =  det  (<*>T**) .  det  (A)  (34) 

However,  the  elements  of  4  are  the  expansion  eigenfunctions  which  in 
this  case  are  the  complex  exponentials,  viz  d  C k)  =  exp  (-j2imk/N),  and  is 
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therefore  related  to  the  Discrete  Fourier  Transform  (DFT) .  In  fact 
the  DFT  of  a  vector  £  is 

Z  A  DFT  (£)  =  *  z  (35) 

hence,  the  Inverse  DFT  is 

z  =  I  $*  Z  (36) 

—  N  — 

which  implies  that  4*  $  =  I,  the  identity  matrix,  from  which  it 
T 

follows  that  det  (4  4*)  =  N.  Using  this  result  in  (34)  leads  to  the 
equat ion 

det  (A)  =  j-j-  det  (R)  (37) 

hence  the  exact  solution  to  the  ML  problem  depends  on  the  evaluation 
of  det  (R) .  This  latter  computation  has  been  done  by  Kay  [2]  in  a 
covariance  domain  derivation  of  the  exact  solution  to  the  ML  problem. 
He  showed  that 

2N 

det  (r)  =  (38) 

it  (i-kV 

where  K  .Kj, • • • » 

related  to  the  original  all-pole  spectral  model  through  the  Levinson- 
Durbin  algorithm  [6]  which  requires  that 


are  the  so-called  reflection  coefficients  and  are 


fa0":15-  K  a(m-P 
i  m  m-i 


a(m)  =4 

l 


i=l ,2, . . . ,m-l 


(39) 


As  a  consequence  of  (29),  (37)  and  (38)  it  follows  that 


N-l  M  , 

2  log  An  =  N  log  at  -  2  m  loS  (1-Km)  -  log  N 
n=o  m= 1 


(40) 
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which  can  be  substituted  into  (28)  to  result  in  the  following  equation 
for  the  exact  likelihood  function: 


UD  =  4"  V  |Xn!2|AMC“n)|2  +  N  lo«  °M  -  2  ■  log  (1-K2)  +  logfc^j1-]) 
n=o  m=l  '  ' 

Since  equation  (39)  can  be  used  with  (26)  to  derive  the  well-known 

recursions: 


(41) 


V“>  ■  Vi<“>  *  VW->  (42*> 

Bm(u)  =  exp  (-jo,)  { Bm_ j (oj)  +  KmAm  l(w)]  (42b) 

which  are  initialized  at  stage  m=o  with  the  conditions 


A  (w)  =  1  (43a) 

Bq(u)  =  -exp(-jw)  (43b) 


it  follows  that  the  likelihood  function  now  depends  only  on  the  parameters 
2 

aM>  Kj,  K2>  ....  K^.  Equations  (41)-(43)  represent  the  spectral  domain 

equivalent  of  Kay’s  covariance  based  expression  for  the  exact  likelihood 

function.  However,  by  suitably  interpreting  the  meaning  of  the  power 

measurements,  j |  ,  the  above  results  apply  to  the  more  general  case 

that  includes  aperiodic  and  periodic  processes. 

The  gain  optimized  likelihood  function  is  obtained  by  choosing 
2 

such  that  at/aa^  =  0.  This  results  in  the  ML  estimate  for  the  gain 
which  is 


n=o 


(44) 


and  the  gain  optimized  likelihood  function  becomes 

*2  M  2  ( 

KKj.Kj,...,*,,)  =  N  log  o,;  -  Z  m  log  (1-Kp  ♦  log  3^— 

m=l 


(45) 


In  order  to  derive  a  simple  recursion  for  £(K)  the  vectors 


14 


=  (^“o5,  'VVp) 

(W’  V*!5,  "•’  BMCu>N-1^ ) 


are  used  to  define  an  inner  product 


<4m-  2m >4  K  lXJ2  WW 


By  defining  the  variables 


«M  =  Re  <4r  ^M> 


eM  =  iiSmI 


and  using  the  recursion  in  (42a)  for  A^u)  it  follows  that 

?-»y2 

=  °M- 1  +  2KM  Vl  +  ^  BM-1 


However,  the  recursion  in  (42b)  results  in  the  fact  that 


=  bm-i  +  2km  Vi  +  'i  Vi 


and  the  initial  conditions  in  (43)  require  that 
N-l 

A2  1  V  lv  l 2  D  t 

5  =  1  X  =  B  I 

o  N  1  n 1  o 
n=o 

"2 

As  a  consequence  and  6^  obey  the  same  recursion,  and  therefore 
'2 

°M  =  BM  ( 


for  all  M.  As  a  result  the  recursion  for  can  be  written  as 

M 

3  ■  Vl  O  ♦  2  »M-1  ^ 


I 

■  j 
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where 


mM-1  ~2 

°M-1 


Letting  =  Jl(K^,  K^p . . .  ,1^) ,  and  substituting  (53)  in  (45)  leads  to  the 
expression 

«M  ■  N  >»«  4-1  *  N  '»«  108  "•Ki  *  l0E 

m=  l  *-  J 


=  *m_i  +  N  108  (l+2pM-lKM+4)  -  M  l0g  U-Kj) 


which  with  the  initial  condition 


4  =  N  log  oq  +  log 


[~(Tte)N  ' 

L  N 


is  the  desired  recursion.  This  is  a  somewhat  simpler  expression  than 
that  obtained  by  Kay  due  to  the  fact  that  the  frequency  domain  formulation 
led  to  the  condition  = 

When  Itakura  and  Saito  studied  the  problem  of  ML  estimation  of  the 
spectral  parameters  of  all-pole  aperiodic  processes,  they  invoked  the 
condition  that  N,  the  number  of  independent  spectral  measurements, 
greatly  exceeded  the  model  order,  M,  (i.e.,  N>>M).  Under  this  condition 
the  likelihood  function  in  (55)  reduces  to 


*M  =  *M-1  +  N  log  (  1  +  2pM- 1  +  V  f57) 

If  the  optimum  values  for  Kj,^, _ ,1^  ^  have  been  found,  then 

takes  its  smallest  value  and  5^  is  minimized  simply  by  choosing  1^  to 
satisfy  =  0.  This  results  in  the  estimator 


*H  "  •  PM-1  '  ' 


The  recursion  for  the  spectral  gain,  equation  (55),  then  simplifies  to 


'2  '2  *2. 

°M  "  °M- 1  U-S? 
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Furthermore,  application  of  the  Schwartz  Inequality  to  (48a)  is  sufficient 
to  show  that  the  ML  reflection  coefficients  will  satisfy  the  condition 
| Kj  <1.  Therefore,  the  results  for  the  condition  N>>M  are  consistent 
with  those  that  have  been  obtained  by  Markel  and  Gray  [5]  using  the 
orthogonal  polynomial  approach  to  LPC  analysis. 

For  the  more  general  case  for  which  the  values  of  N  and  M  are 
arbitrary,  the  ML  estimate  for  must  also  satisfy  the  condition =  0 
which  leads  to  the  cubic  equation 


(N-M)  *  (N'2M)  PM.j  Kj$  -  (N+M)  ^  -  Np^  =  0 


(60) 


Although  there  are  three  roots  to  this  equation,  inspection  of  (55) 
shows  that  there  will  be  only  one  root  that  satisfies  |K,,|  <1.  Since 
closed  form  expressions  are  available  for  the  roots  of  equation  (60), 
then,  if  computations  are  being  done  using  floating  point  arithmetic,  it 
is  straightforward  to  determine  the  ML  estimator  numerically.  In  speech 
applications  which  require  real-time  processing  using  fixed  point  arithmetic 
it  has  been  found  that  Newton-Raphson  methods  are  easier  to  implement. 

In  the  next  section  some  examples  will  be  given  that  show  the  difference 
in  the  spectral  estimates  obtained  using  the  exact  and  approximate  ML 
estimators. 

IV.  APPLICATION  TO  SPEECH  CODING 

In  contrast  to  the  standard  approaches  to  all-pole  spectral  analysis 
that  are  based  on  estimates  of  the  ensemble  covariance  function 


computed  from  time-domain  data,  the  ML  procedure  estimates  the  power 

y 

spectral  density  through  the  frequency  domain  power  measurement  | | 
where 

1  N*1 

Xn  =  jj  I  S(k)  exp  (-j2*nk/N)  (61) 

n=o 
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where  N  represents  the  frame  length  for  aperiodic  processes  and  the 
period  for  periodic  processes.  In  order  to  make  the  requisite  power 
measurements  in  the  latter  case,  the  pitch  period  must  be  known  with 
precision,  which  is  difficult  to  achieve  in  practice.  In  addition, 
evaluation  of  the  DFT  in  (61)  represents  a  computationally  intensive 
task.  A  more  efficient  approach  would  be  to  use  the  Fast  Fourier  Transform 
(FFT)  to  generate  a  high  resolution  spectrum  which  could  be  sampled  at 
the  pitch  harmonics  to  determine  the  desired  power  measurements.  The 
problem  is,  however,  that  not  only  is  it  difficult  to  know  the  true 
pitch  with  precision,  but  in  many  cases  the  voiced  speech  spectrum  is 
not  always  harmonic,  especially  above  1000  Hz.  Hence,  sampling  the  high 
resolution  spectrum  could  not  be  expected  to  produce  reliable  estimates 
of  the  power  in  the  pitch  harmonics.  In  the  development  of  the  SEEVOC 
voice  coding  algorithm  Paul  [7]  has  developed  a  method  for  avoiding 
these  problems  using  a  heuristic  that  selects  the  largest  spectral  peak 
within  successive  frequency  windows  equal  in  width  to  the  average  pitch 
frequency.  For  example,  if  the  average  pitch  is  wo>  then  the  first  step 

is  to  find  the  largest  peak  in  the  frequency  range  (coo/2 ,  3ujo/2). 

2 

Suppose  this  peak,  which  yields  the  power  |Xj |  ,  occurs  at  frequency  Uj , 

2 

then  the  next  peak,  | |  ,  is  found  in  the  region  (u)j+u>o/2,  +  3u>o/2 ) , 

and  so  on.  If  no  peak  exists  within  a  particular  region,  then  the 
frequency  at  which  the  largest  end  point  occurs  is  chosen  as  the  next 
harmonic.  The  algorithm  has  been  studied  extensively  in  simulations  and 
in  real-time  implementations  and  has  been  found  to  produce  satisfactory 
measurements  for  the  harmonic  powers  |x  |  . 

In  order  to  reduce  the  computational  complexity  the  same  algorithm 
is  used  to  determine  the  power  measurements  during  unvoiced  speech. 

Since  the  high  resolution  spectrum  is  more  or  less  continuous  (i.e.  ,  not 
harmonic),  the  peak  picking  algorithm  essentially  samples  the  measured 
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spectrum  at  the  average  pitch  frequency  aio.  Therefore, in  summary,  the 
power  measurements  |X^|  and  the  frequencies  un  that  are  actually  used 
in  the  ML  estimation  procedure  are  taken  to  be  those  obtained  from  the 
pitch-directed  peak-picking  routine.  Although  these  measurements  are 
not  harmonically  related  in  general,  the  properties  of  the  nonlinear  ML 
spectral  matching  criterion  described  in  Section  II,  still  apply  and 
determine  the  nature  of  the  all-pole  model  fit  to  the  power  measurements. 

The  next  step  in  determining  the  spectral  parameters  is  to  set  up 
initial  conditions  from  (43),  (44),  and  (48a).  These  are 


A  (u>  )  =  1 
ov  n 


Bo  (lV  =  -exPMwn) 

,,  ,  N-l 

o2  =  £  1  | X  | 2 

o  N  1  n1 
n=o 

l  7 

a  =  u  I  |x I  cos  u 
on  n 1  n 

n=o 


(62b) 

(62c.) 

(62d) 


The  first  reflection  coefficient  is  found  by  solving  (60),  which,  for 
this  case  is, 


(N-l)  K?  +  (N-2)  0  K2  -  (N+l)  K,  -  Np  =0  (63) 

1  0  1  10 

*  2 

where  p  =  a  /a  .  The  approximate  ML  estimate  of  K.  is  K  =  -p  ,  which 
o  o  o  rr  1  1  o 

is  used  to  initialize  a  Newton-Raphson  search  for  the  solution  of  (63). 
Usually  convergence  to  15  bit  accuracy  is  obtained  in  less  than  five 
iterations.  Having  obtained  Kj ,  the  next  step  is  to  update  the  above 
quantities  according  to  (42),  (48a)  and  (33).  The  appropriate 
computations  are 


W  =  Vi'V  +  Wm-i(V  f64a) 

W  =  exP  (jwn}  IWV  +  SAl-lK51  C64b) 
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(64c) 


"M  ■  H  £lKl2  Re  W 

n=0'-  J 

;m  ■  vi  *  «5>  (Md> 

*2 

which  are  used  to  determine  =  ct^/o^  and  then  =  -P^  is  used  to 

initiate  the  Newton-Raphson  search  for  the  solution  to 

(N-M-l)  Kj+1  ♦  (N-2M-2)  pM  ^+1  -  (N+M+l)  ^  =  0  (65) 

This  algorithm  has  been  found  to  be  amenable  to  fixed  point  arithmetic 

and  has  been  found  to  produce  numerically  stable  estimates  of  the  power 

spectral  density.  Furthermore,  at  the  end  of  the  recursive  process,  the 

data  is  immediately  available  for  computing  the  spectral  envelope  which 

is  extremely  useful  in  certain  low  rate  speech  coding  applications  [3]. 

Typical  results  that  are  obtained  using  this  procedure  are  shown  in  Fig.  2 

for  a  voiced  speech  segment  for  a  low  pitched  male  speaker.  The  ML 

power  measurements  obtained  from  the  peak-picking  heuristic  are  indicated 

by  the  crosses  at  the  top  of  the  vertical  bars.  The  spectral  envelopes 

obtained  by  the  exact  and  the  approximate  ML  procedures  are  shown  for 

comparison.  Figure  3  illustrates  typical  results  for  a  high-pitched 

female.  While  this  represents  a  worst  case  condition  with  respect  to 

the  validity  of  the  approximate  solution,  the  spectral  fit  appears  to  be 

quite  good.  An  example  of  the  spectral  envelopes  obtained  for  an  unvoiced 

speech  segment  is  illustrated  in  Fig.  4. 

A  considerable  amount  of  speech  data  has  been  examined  graphically 

and  it  has  been  observed  that  for  voiced  speech  the  exact  algorithm 

generally  leads  to  wider  formant  bandwidths  and  reduced  formant  amplitudes. 

From  (55)  and  (57)  it  is  noted  that  the  difference  between  the  exact  and 

2 

approximate  ML  criteria  is  the  term  -M  log  (1-K^j)  which  has  the  effect  of  forcing 
the  reflection  coefficient  towards  zero,  and  causes  the  formant  broadening. 

For  unvoiced  speech  the  approximate  and  the  exact  ML  analyses  result  in 
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POWER  (dB ) 


FREQUENCY  (kHz) 


Fig.  2.  Voiced  speech  spectral  envelopes  (male  speaker  pitch  =  119  lie). 
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.  3.  Voiced  speech  spectral  envelopes  (female  speaker  pitch  =  250  Hz). 
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spectral  fits  that  appear  to  be  essentially  the  same.  It  is  interesting 
to  note  that  even  in  the  case  of  the  low-pitched  speaker,  the  curves  in 
Fig.  2  show  that  the  N>>M  approximation  is  not  valid.  This  suggests 
that  for  voiced  speech  it  may  be  incorrect  to  interpret  LPC  spectral 
fits  in  terms  of  the  Itakura-Saito  spectral  matching  criterion.  Rather 
it  would  appear  that  it  would  be  more  accurate  to  explain  the  LPC  spectral 
fitting  process  in  terms  of  spectral  flattening  as  was  done  by  Makhoul 
[8]  and  Markel  and  Gray  [5]. 

In  order  to  determine  whether  or  not  these  numerical  differences 
were  perceptually  detectable,  a  real-time  speech  analysis/ synthesis 
vocoding  system  was  developed  using  a  tenth  order  spectral  model  over  a 
4  kHz  audio  bandwidth.  The  estimated  spectral  parameters  were  used  in 
conjunction  with  a  standard  acoustic  tube  synthesizer  and  a  high  quality 
spectrally  flattened  filterbank  synthesizer  [9].  It  was  judged  that  the 
exact  ML  procedure  resulted  in  synthetic  speech  that  had  the  quality  of 
having  been  too  heavily  smoothed.  For  the  acoustic  tube  synthesizer, 
minor  spectral  distortions  were  produced  occasionally  when  the  approximate 
analysis  algorithm  was  used.  Although  these  distortions  were  not  produced 
by  the  exact  algorithm,  the  loss  in  speech  naturalness  caused  by  the 
excessive  smoothing  resulted  in  a  preference  for  the  approximate  ML 
analysis  system.  These  effects  were  less  pronounced  for  female  speakers. 
In  all  cases  the  differences  were  less  noticeable  when  the  filterbank 
synthesizer  was  used. 

In  another  real-time  evaluation  the  approximate  pitch-directed 
frequency  domain  ML  vocoder  was  compared  with  a  standard  autocorrelation 
LPC  system.  The  quality  and  intelligibility  of  these  systems  were 
judged  to  be  essentially  equivalent.  Therefore,  from  a  perceptual  point 
of  view,  there  appears  to  be  no  advantage  in  using  the  frequency  based 
spectral  estimator  that  makes  explicit  use  of  the  voiced  and  unvoiced 
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speech  models.  It  turns  out,  however,  that  the  frequency  domain  implementation 
is  particularly  well  suited  to  the  development  of  new  vocoding  algorithms. 

V.  SUMMARY  AND  CONCLUSIONS 

The  overall  objective  of  this  report  was  to  try  to  determine  whether 
or  not  better  spectral  estimators  could  be  developed  for  speech  by 
taking  into  account  the  periodic  structure  of  the  voiced  speech  sounds. 

As  a  starting  point  it  was  assumed  that  both  voiced  and  unvoiced  speech 
could  be  modelled  as  random  processes.  In  the  latter  case  the  linear 
filter  driven  by  white  noise  model  not  only  leads  to  a  convenient  mathematical 
model,  but  also  corresponds  to  the  speech  production  mechanism.  Voiced 
speech  was  modelled  as  a  periodic  random  process,  a  random  process 
having  a  periodic  ensemble  covariance  function.  Although  this  model  has 
no  relevance  to  the  physiological  mechanism  by  which  voiced  sounds  are 
produced,  mathematically  it  can  be  used  to  generate  a  class  of  spectra 
that  have  roughly  the  same  properties  as  voiced  speech  spectra,  and 
hence  it  was  adopted  as  being  suitable  for  the  type  of  analysis  to  be 
undertaken.  From  this  common  mathematical  framework,  the  random  process 
could  be  expanded  in  terms  of  complex  exponential  basis  functions,  such 
that  the  statistical  information  was  embedded  in  a  set  of  expansion 
coefficients  that  were  uncorrelated  random  variables.  Assuming  that  the 
Gaussian  model  could  be  applied  to  the  speech  processes,  the  probability 
density  function  could  be  computed.  The  parameters  of  the  underlying 
spectral  model  were  embedded  in  a  set  of  eigenvalues  on  which  the  pdf 
depended  explicitly.  Statistical  estimates  of  these  parameters  were 
obtained  using  the  maximum  likelihood  method  and  it  was  found  that  the 
nature  of  the  resulting  spectral  fit  was  a  highly  nonlinear  function  of 
the  logarithmic  error  between  the  measured  and  the  model  spectra.  In 
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fact  this  maximum  likelihood  spectral  matching  criterion  was  the  same  as 
that  derived  by  Itakura  and  Saito  [1]  for  the  special  case  of  unvoiced 
(aperiodic)  sounds  for  which  the  linear  filter  was  all-pole.  The  first 
important  result  of  the  analysis  was  to  show  that  this  spectral  matching 
criterion  was  a  general  property  of  ML  estimation  and  applied  to  aperiodic 
and  to  periodic  processes  and  did  not  depend  on  a  specific  parametric 
spectral  model.  In  particular,  it  did  not  need  to  be  all-pole.  Furthermore, 
although  the  Gaussian  assumption  was  needed  to  derive  the  ML  spectral 
matching  criterion,  it  was  no  longer  essential  to  the  analysis,  since 
any  subsequent  evaluation  of  the  ML  spectral  estimates  could  be  judged 
on  the  basis  of  the  "goodness"  of  the  spectral  match. 

The  analysis  was  then  specialized  to  the  all-pole  spectral  envelope 
and  solved  exactly  drawing  heavily  upon  some  recent  work  by  Kay  [2]. 
Hence-to-fore,  solutions  to  the  ML  estimation  problem  were  approximate 
requiring  that  the  condition  N>>M  be  satisfied  where  M  was  the  all-pole 
model  order  and  where,  for  unvoiced  speech,  N  was  the  frame  length.  For 
voiced  speech  N  was  the  pitch  period  or  equivalently  the  number  of 
harmonics  in  the  audio  bandwidth.  Since  a  pitch  of  300  Hz  is  not  unusual, 
then  for  a  3600  Hz  bandwidth  there  would  be  only  12  harmonics,  hence, 
for  a  10th  order  model  it  appeared  that  the  results  of  the  approximate 
ML  analysis  would  not  be  valid  for  high-pitched  speakers  and  perhaps  was 
the  reason  for  the  poor  performance  of  all-pole  vocoders  in  this  particular 
case. 

In  order  to  examine  this  conjecture  in  detail  an  algorithm  was 
developed  for  determining  the  ML  power  measurements  for  speech  using  the 
pitch-directed  peak  picking  heuristic  developed  by  D.  Paul  [7],  Although 
only  minor  differences  were  observed  when  the  approximate  and  the  exact 
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ML  procedures  were  applied  to  unvoiced  speech,  it  was  consistently  noted 
that  for  voiced  speech  the  exact  analysis  led  to  spectral  estimates  that 
widened  the  formant  bandwidths  while  reducing  the  formant  amplitudes. 

Evaluation  using  a  real-time  vocoder  implementation  revealed  that  the 

effect  of  the  exact  analysis  was  to  produce  synthetic  speech  that  had 

the  quality  of  having  been  smoothed  excessively.  The  synthetic  speech 

produced  by  the  approximate  analysis  was  judged  to  be  more  natural  and 

was  the  preferred  system,  which  shows  that  a  strict  implementation  of 

the  Itakura-Saito  criterion  is  not  perceptually  desirable.  Rather,  the 

spectral  flattening  criterion  that  is  implicit  in  LPC,  seems  to  be 

preferable.  The  approximate  ML  system  was  then  compared  with  a  standard 

autocorrelation  based  LPC  vocoder  and  the  synthetic  speech  for  both 

systems  was  judged  to  be  essentially  equivalent  in  quality  and  intelligibility. 


Therefore,  although  a  generalized  and  unifying  theory  for  spectral 
analysis  of  aperiodic  and  periodic  processes  has  successfully  been 
developed,  its  application  to  narrowband  speech  coding  has  nor  led  to 
significant  perceptual  improvements.  That  is  not  to  say  that  the  frequetu  v 
domain  formulation  suggested  by  the  ML  procedure  is  not  worthy  of  exploitati  | 

On  the  contrary  it  has  been  crucial  to  the  development  of  formant  -based  1 

low  rate  systems  [3]  and  the  theory  can  be  used  as  a  basis  for  anal>  in, 
some  of  the  time-domain  pitch- adaptive  algorithms  that  have  alre.uh  been 
developed  [11].  Furthermore,  the  frequency  domain  implement  at i on 
allows  for  the  possibility  of  sampling  the  speech  at  a  high  rate.  |o  in 
say,  and  analyzing  the  speech  spectrum  at  any  other  lower  rate  simpU  K 
altering  the  cos(w)  table  to  reflect  the  desired  folding  frequence  T h i - 
allows  for  the  development  of  a  new  class  of  voicing  adaptive  split  band 
vocoder  algorithms,  which,  for  a  fixed  model  order  (10  say)  van  adjust 
to  a  wider  bandwidth  (5  kHz  say)  during  unvoiced  speech  and  a  lower 
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bandwidth  (3.4  kHz  say)  during  voiced  speech,  and  thereby  result  in 
"crisper"  more  intelligible  speech  at  the  same  data  rate.  And  finally 
it  should  be  noted  that  care  was  taken  to  derive  the  likelihood  function 
in  such  a  way  that  all  pitch-dependent  terms  were  preserved  so  that 
it  could  be  used  as  a  pitch  and  voicing  statistic. 
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